geopandas to work with vectorsrasterio to work with rastersrasterio to xarraygeopandas and rasterio togetherVectors are discrete geometric locations (vertices) that define the shape of a spatial object on the Earth's surface.
Vector data can be stored in lots of formats:
.shp.json.xyz.csvRasters are pixelated/gridded data where each pixel is associated with a specific geographic location.
Rasters can have different resolutions, from metres to >kilometres:
Rasters can have one or more bands. Bands could represent colours in a colour image, or other dimensions of interest (e.g. time).
Rasters can be stored in lots of formats:
.tiff.grd.nc.jpegCRSs can be geopgrahic or projected.
CRSs can be uniquely described by:
British National Grid, EPSG:27700: https://epsg.io/27700.
geopandas to work with vectors¶GeoPandas combines Pandas and Shapely to be able to manage vector data in a Pandas DataFrame.
See https://geopandas.org/. Recommended installation is via Conda: conda install -c conda-forge geopandas.
import geopandas as gpd
# This could be a path to e.g. a Shapefile, GeoJSON file etc.
# Here we're using a dataset provided with GeoPandas
file_path = gpd.datasets.get_path("naturalearth_lowres")
gdf = gpd.read_file(file_path)
gdf
| pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
|---|---|---|---|---|---|---|
| 0 | 920938 | Oceania | Fiji | FJI | 8374.0 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... |
| 1 | 53950935 | Africa | Tanzania | TZA | 150600.0 | POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... |
| 2 | 603253 | Africa | W. Sahara | ESH | 906.5 | POLYGON ((-8.66559 27.65643, -8.66512 27.58948... |
| 3 | 35623680 | North America | Canada | CAN | 1674000.0 | MULTIPOLYGON (((-122.84000 49.00000, -122.9742... |
| 4 | 326625791 | North America | United States of America | USA | 18560000.0 | MULTIPOLYGON (((-122.84000 49.00000, -120.0000... |
| ... | ... | ... | ... | ... | ... | ... |
| 172 | 7111024 | Europe | Serbia | SRB | 101800.0 | POLYGON ((18.82982 45.90887, 18.82984 45.90888... |
| 173 | 642550 | Europe | Montenegro | MNE | 10610.0 | POLYGON ((20.07070 42.58863, 19.80161 42.50009... |
| 174 | 1895250 | Europe | Kosovo | -99 | 18490.0 | POLYGON ((20.59025 41.85541, 20.52295 42.21787... |
| 175 | 1218208 | North America | Trinidad and Tobago | TTO | 43570.0 | POLYGON ((-61.68000 10.76000, -61.10500 10.890... |
| 176 | 13026129 | Africa | S. Sudan | SSD | 20880.0 | POLYGON ((30.83385 3.50917, 29.95350 4.17370, ... |
177 rows × 6 columns
gdf.crs
<Geographic 2D CRS: EPSG:4326> Name: WGS 84 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: World. - bounds: (-180.0, -90.0, 180.0, 90.0) Datum: World Geodetic System 1984 - Ellipsoid: WGS 84 - Prime Meridian: Greenwich
gdf.plot("pop_est", legend=True)
<AxesSubplot:>
# Filter the dataset where column iso_a3 equals GBR and then plot
gdf[gdf['iso_a3'] == 'GBR'].plot()
<AxesSubplot:>
gdf_wm = gdf.to_crs('EPSG:3857')
gdf_wm.plot()
<AxesSubplot:>
gdf.plot()
<AxesSubplot:>
We can easily measure the area of polygons:
# Create a new column 'area' that is the area of each geometry (country)
gdf_wm['area'] = gdf_wm.area
gdf_wm.sort_values('area', ascending=False)
| pop_est | continent | name | iso_a3 | gdp_md_est | geometry | area | |
|---|---|---|---|---|---|---|---|
| 18 | 142257519 | Europe | Russia | RUS | 3745000.00 | MULTIPOLYGON (((19895609.388 11436139.118, 200... | 8.304514e+13 |
| 3 | 35623680 | North America | Canada | CAN | 1674000.00 | MULTIPOLYGON (((-13674486.249 6274861.394, -13... | 5.216648e+13 |
| 22 | 57713 | North America | Greenland | GRL | 2173.00 | POLYGON ((-5205721.290 17490757.185, -4831982.... | 3.628550e+13 |
| 4 | 326625791 | North America | United States of America | USA | 18560000.00 | MULTIPOLYGON (((-13674486.249 6274861.394, -13... | 2.186228e+13 |
| 139 | 1379302771 | Asia | China | CHN | 21140000.00 | MULTIPOLYGON (((12186724.586 2060702.116, 1209... | 1.497731e+13 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 175 | 1218208 | North America | Trinidad and Tobago | TTO | 43570.00 | POLYGON ((-6866186.192 1204901.071, -6802177.4... | 8.051555e+09 |
| 79 | 4543126 | Asia | Palestine | PSE | 21220.77 | POLYGON ((3940438.428 3696430.498, 3888101.327... | 7.015327e+09 |
| 128 | 594130 | Europe | Luxembourg | LUX | 58740.00 | POLYGON ((672711.849 6468481.737, 694939.873 6... | 5.785823e+09 |
| 160 | 265100 | Asia | N. Cyprus | -99 | 3600.00 | POLYGON ((3643685.108 4182926.430, 3651554.656... | 5.686154e+09 |
| 159 | 4050 | Antarctica | Antarctica | ATA | 810.00 | MULTIPOLYGON (((-5416874.996 -14393907.644, -5... | NaN |
177 rows × 7 columns
We can get the centre of the polygons:
# Create a new column that is the centroid of each geometry (country)
gdf_wm['centroid'] = gdf_wm.centroid
gdf_wm['centroid'].plot()
<AxesSubplot:>
We can calculate distances between these centroids. E.g. say we want to calculate the distance from the UK:
# Get the row that is for the UK
uk_centroid = gdf_wm[gdf_wm['name'] == 'United Kingdom']['centroid'].iloc[0]
# Create a new column and use the distance() method to calculate the distance to the UK centroid
gdf_wm['distance_to_uk'] = gdf_wm['centroid'].distance(uk_centroid)
gdf_wm.sort_values('distance_to_uk', ascending=False)
| pop_est | continent | name | iso_a3 | gdp_md_est | geometry | area | centroid | distance_to_uk | |
|---|---|---|---|---|---|---|---|---|---|
| 136 | 4510327 | Oceania | New Zealand | NZL | 174800.0 | MULTIPOLYGON (((19690839.812 -4875534.642, 196... | 5.005072e+11 | POINT (19211133.527 -5143927.928) | 2.310196e+07 |
| 134 | 279070 | Oceania | New Caledonia | NCL | 10770.0 | POLYGON ((18454544.055 -2401420.887, 18545826.... | 2.686859e+10 | POINT (18427504.310 -2423487.891) | 2.107120e+07 |
| 89 | 282814 | Oceania | Vanuatu | VUT | 723.0 | MULTIPOLYGON (((18614489.182 -1792201.332, 186... | 8.119409e+09 | POINT (18598636.273 -1752068.573) | 2.092808e+07 |
| 0 | 920938 | Oceania | Fiji | FJI | 8374.0 | MULTIPOLYGON (((20037508.343 -1812498.413, 200... | 2.128334e+10 | POINT (18248781.791 -1958098.338) | 2.070252e+07 |
| 135 | 647581 | Oceania | Solomon Is. | SLB | 1198.0 | MULTIPOLYGON (((18047007.277 -1173496.191, 180... | 2.549835e+10 | POINT (17807849.653 -989994.877) | 1.989061e+07 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 129 | 11491346 | Europe | Belgium | BEL | 508600.0 | POLYGON ((685356.051 6586647.408, 672711.849 6... | 7.485603e+10 | POINT (509597.156 6561035.236) | 1.043256e+06 |
| 130 | 17084719 | Europe | Netherlands | NLD | 870800.0 | POLYGON ((768676.624 7072687.677, 789483.757 7... | 1.067734e+11 | POINT (614065.187 6856791.175) | 9.944941e+05 |
| 133 | 5011102 | Europe | Ireland | IRL | 322000.0 | POLYGON ((-689945.390 7145114.483, -671588.863... | 1.626143e+11 | POINT (-891472.831 7020537.812) | 5.929234e+05 |
| 143 | 64769452 | Europe | United Kingdom | GBR | 2788000.0 | MULTIPOLYGON (((-689945.390 7145114.483, -7740... | 7.223808e+11 | POINT (-323128.917 7189485.698) | 0.000000e+00 |
| 159 | 4050 | Antarctica | Antarctica | ATA | 810.0 | MULTIPOLYGON (((-5416874.996 -14393907.644, -5... | NaN | POINT EMPTY | NaN |
177 rows × 9 columns
Remember this is just a Pandas DataFrame, so we can do all the usual Pandas stuff with the data.
# Get only countries with populations of >50 million and plot on a bar chart
gdf_big = gdf_wm[gdf_wm['pop_est'] > 50000000]
gdf_big = gdf_big.sort_values('distance_to_uk')
gdf_big.plot.bar(x='name', y='distance_to_uk')
<AxesSubplot:xlabel='name'>
gdf_big[['name','geometry','distance_to_uk']].to_file('./data/big_countries.geojson',
driver='GeoJSON')
rasterio to work with rasters¶Rasterio is library that wraps around the ubiquitous GDAL library to give a user-friendly interface for dealing with raster data.
See https://rasterio.readthedocs.io/en/latest/. Recommended installation is via Conda: conda install -c conda-forge rasterio.
import rasterio as rio
rs = rio.open('./data/rainfall_5km_2017.tif')
rs.crs
CRS.from_wkt('PROJCS["unnamed",GEOGCS["OSGB 1936",DATUM["OSGB_1936",SPHEROID["Airy 1830",6377563.396,299.324964600004,AUTHORITY["EPSG","7001"]],AUTHORITY["EPSG","6277"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4277"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",49],PARAMETER["central_meridian",-2],PARAMETER["scale_factor",0.9996012717],PARAMETER["false_easting",400000],PARAMETER["false_northing",-100000],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]')
rs.res
(5000.0, 5000.0)
rs.bounds
BoundingBox(left=0.0, bottom=0.0, right=700000.0, top=1250000.0)
rs.shape
(250, 140)
rs.count
366
Rasterio is really just adding some metadata to NumPy arrays:
arr = rs.read(300, masked=True)
arr
masked_array(
data=[[--, --, --, ..., --, --, --],
[--, --, --, ..., --, --, --],
[--, --, --, ..., --, --, --],
...,
[--, --, --, ..., --, --, --],
[--, --, --, ..., --, --, --],
[--, --, --, ..., --, --, --]],
mask=[[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]],
fill_value=-3.4e+38,
dtype=float32)
print(rs.nodata)
arr = rs.read(300)
arr
-3.4e+38
array([[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38],
[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38],
[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38],
...,
[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38],
[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38],
[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]],
dtype=float32)
Rasterio has a convert rasterio.plot.show() function, which wraps around Matplotlib:
from rasterio.plot import show, show_hist
show(rs)
<AxesSubplot:>
show_hist(rs.read(300, masked=True))
Rasterio has a fairly extensive CLI capable of performing common useful tasks:
rio insp and rio info are particularly useful for having a quick look at files. For example, to quickly get metadata about a file:
$ rio info rainfall_5km_2015-300.tif --verbose
{"bounds": [0.0, 0.0, 700000.0, 1250000.0], "checksum": [8245], "colorinterp": ["gray"], "compress": "lzw", "count": 1, "crs": "PROJCS[\"unnamed\",GEOGCS[\"Airy 1830\",DATUM[\"unknown\",SPHEROID[\"airy\",6377563.396,299.3249753150345],TOWGS84[446.448,-125.157,542.06,0.15,0.247,0.842,-20.489]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",49],PARAMETER[\"central_meridian\",-2],PARAMETER[\"scale_factor\",0.9996012717],PARAMETER[\"false_easting\",400000],PARAMETER[\"false_northing\",-100000],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]]]", "descriptions": [null], "driver": "GTiff", "dtype": "float32", "height": 250, "indexes": [1], "interleave": "band", "lnglat": [-2.7933739244523172, 55.516181609501864], "mask_flags": [["nodata"]], "nodata": -3.4e+38, "res": [5000.0, 5000.0], "shape": [250, 140], "stats": [{"max": 20.69357681274414, "mean": 3.7167322386771935, "min": 0.0}], "tiled": false, "transform": [5000.0, 0.0, 0.0, 0.0, -5000.0, 1250000.0, 0.0, 0.0, 1.0], "units": [null], "width": 140}
Quickly plotting a file:
$ rio insp rainfall_5km_2015-300.tif
Rasterio 1.0.25 Interactive Inspector (Python 3.7.3)
Type "src.meta", "src.read(1)", or "help(src)" for more information.
>>> rasterio.plot.show(src)
Quickly reprojecting (changing coordinate system) and resampling (changing resolution) is very easy with rio warp:
Change CRS to British National Grid:
$ rio warp input.tif output.tif --dst-crs EPSG:27700
Resample to 1x1 km grid using cubic interpolation:
$ rio warp input.tif output.tif --res 1000 --resampling cubic
Based on another raster:
$ rio warp input.tif output.tif --like template.tif
rasterio with xarray¶A neat little package called rioxarray lets you combine rasterios functionality into xarray, as well as adding an easier-to-use interface to some function (e.g. reprojection, clipping).
xarray: Multidimensional labelled arrays. Introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays.

import xarray
import rioxarray
import pandas as pd
xds = rioxarray.open_rasterio('./data/rainfall_5km_2017.tif', masked=True)
xds
<xarray.DataArray (band: 366, y: 250, x: 140)>
[12810000 values with dtype=float64]
Coordinates:
* band (band) int64 1 2 3 4 5 6 7 8 ... 360 361 362 363 364 365 366
* y (y) float64 1.248e+06 1.242e+06 1.238e+06 ... 7.5e+03 2.5e+03
* x (x) float64 2.5e+03 7.5e+03 1.25e+04 ... 6.925e+05 6.975e+05
spatial_ref int64 0
Attributes:
scale_factor: 1.0
add_offset: 0.0
grid_mapping: spatial_ref[12810000 values with dtype=float64]
array([ 1, 2, 3, ..., 364, 365, 366])
array([1247500., 1242500., 1237500., ..., 12500., 7500., 2500.])
array([ 2500., 7500., 12500., 17500., 22500., 27500., 32500., 37500.,
42500., 47500., 52500., 57500., 62500., 67500., 72500., 77500.,
82500., 87500., 92500., 97500., 102500., 107500., 112500., 117500.,
122500., 127500., 132500., 137500., 142500., 147500., 152500., 157500.,
162500., 167500., 172500., 177500., 182500., 187500., 192500., 197500.,
202500., 207500., 212500., 217500., 222500., 227500., 232500., 237500.,
242500., 247500., 252500., 257500., 262500., 267500., 272500., 277500.,
282500., 287500., 292500., 297500., 302500., 307500., 312500., 317500.,
322500., 327500., 332500., 337500., 342500., 347500., 352500., 357500.,
362500., 367500., 372500., 377500., 382500., 387500., 392500., 397500.,
402500., 407500., 412500., 417500., 422500., 427500., 432500., 437500.,
442500., 447500., 452500., 457500., 462500., 467500., 472500., 477500.,
482500., 487500., 492500., 497500., 502500., 507500., 512500., 517500.,
522500., 527500., 532500., 537500., 542500., 547500., 552500., 557500.,
562500., 567500., 572500., 577500., 582500., 587500., 592500., 597500.,
602500., 607500., 612500., 617500., 622500., 627500., 632500., 637500.,
642500., 647500., 652500., 657500., 662500., 667500., 672500., 677500.,
682500., 687500., 692500., 697500.])array(0)
xds.sel(x=250000, y=250000, method='nearest').plot()
[<matplotlib.lines.Line2D at 0x7f9f0a029490>]
Say we want to clip data for the Thames catchment, for which we have a Shapefile (vector):
gdf_thames = gpd.read_file('./data/thames_catchment_osgb/thames_catchment_osgb.shp')
gdf_thames.plot()
<AxesSubplot:>
clipped = xds.rio.clip(gdf_thames.geometry, gdf_thames.crs)
clipped.sel(band=300).plot()
# Could also do: clipped[299,:,:].plot()
<matplotlib.collections.QuadMesh at 0x7f9f09ff60d0>
clipped.rio.to_raster('./data/thames.tif')
Reprojecting a raster to a different CRS is simple with rioxarray. Say we want to reproject the previous raster to EPSG:3857 (used by e.g. Google Maps), and at the same time resample it from 5x5 km to 1x1 km using bilinear interpolation.
# Rasterio resampling methods available: nearest, bilinear, cubic,
# cubic_spline, lanczos, average, mode, max, min, med, q1, q3
from rasterio.warp import Resampling
clipped_wm = clipped.rio.reproject('EPSG:3857',
resolution=(1000,1000),
resampling=Resampling.bilinear)
clipped_wm.rio.crs
CRS.from_epsg(3857)
clipped_wm.rio.resolution()
(1000.0, -1000.0)
clipped_wm.sel(band=300).plot()
<matplotlib.collections.QuadMesh at 0x7f9efba5c2e0>
Note the x-coordinate origin is now at the Prime Meridian in Greenwich.
Say we have some sampling data at various points around the country, and we want turn this into a raster by interpolating between the points. We can use geopandas, rasterio and rioxarray to do this.
Here I'm using some soil chemistry data from the Thames catchment. Our data looks like this:
gdf = gpd.read_file('./data/thames_k_att_shp/thames_k_att_osgb.shp')
gdf.plot('k_att', legend=True)
<AxesSubplot:>
from rasterio import features
# Base our raster on a template raster (the Thames catchment in this case)
rs_thames = rio.open('./data/template.tif')
meta = rs_thames.meta.copy()
# Open a new file to write the burnt raster to
with rio.open('./data/rasterized.tif', 'w+', **meta) as out:
out_arr = out.read(1)
# Turn the Shapefile into geom, value tuples to pass to the rasterize function
shapes = ((geom, value) for geom, value in zip(gdf.geometry, gdf.k_att))
# Perform the burning using rasterio's rasterize
rasterized = features.rasterize(shapes, out=out_arr, transform=out.transform)
# Write the burnt raster to file
out.write(rasterized, 1)
Having a look at our burnt raster:
with rio.open('./data/rasterized.tif') as src:
show(src)
interpolate_na from rioxarray¶This method uses scipy.interpole.griddata to interpolate missing data by nearest neighbour, linearly or cubically. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.griddata.html.
xds_int = rioxarray.open_rasterio('./data/rasterized.tif')
filled = xds_int.sel(band=1).rio.interpolate_na('nearest')
filled.plot()
<matplotlib.collections.QuadMesh at 0x7f9efb8a76a0>
filled = xds_int.sel(band=1).rio.interpolate_na('cubic')
# Set nodata to NaN
filled = filled.where(filled != filled.rio.nodata)
filled.plot()
<matplotlib.collections.QuadMesh at 0x7f9efb7db130>